134
Applications in Natural Language Processing
5.6.3
Token-Wise Clipping
The token-wise clipping further efficiently finds a suitable clipping range to achieve min-
imal final quantization loss in a coarse-to-fine procedure. At the coarse-grained stage, by
leveraging the fact that those less important outliers only belong to a few tokens, the au-
thors propose to obtain a preliminary clipping range quickly in a token-wise manner. In
particular, this stage aims to quickly skip over the area where clipping causes little accu-
racy influence. According to the second finding, the long tail area only matches with a few
tokens. Therefore, the max value of the embedding at a token can be its representative.
Also, the min value can be representative of negative outliers. Then, a new tensor with T
elements can be constructed by taking out the maximum signal for each token:
Ou = {max(token1), max(token2), ..., max(tokenT )},
Ol = {min(token1), min(token2), ..., min(tokenT )},
(5.15)
where Ou is marked as the collection of upper bounds, Ol is the collection of lower bounds.
The clipping value is determined by:
cu = quantile(Ou, α),
cl = quantile(Ol, α),
(5.16)
where the quantile is the quantile function that computes the α-th quantiles of its input.
A α that minimizes the final loss is searched in a grid search manner. The author chooses
to use a uniform quantizer. Thus, according to cu and cl, a step size s0 of the uniform
quantizer can be computed given the bit-width b by s = cu−cl
2b−1 .
At the fine-grained stage, the preliminary clipping range is optimized to obtain a better
results. The aim is to make some fine-grained adjustments in the critical area to further
provide a guarantee for the final effect. In detail, with the resulting step size s0 from the
coarse-grained stage is adopted for initialization. Then, a learning based on gradient descent
is used to update parameter step size s toward the final loss with learning rate η:
s = s −η ∂L
∂s .
(5.17)
Due to the wide range of outliers only corresponding to a few tokens, passing through
the unimportant area from the token perspective (the coarse-grained stage) needs much
fewer iterations than from the value perspective (the fine-grained stage). The special design
of the two stages adequately exploits this feature and thus leads a high efficiency.
5.7
BinaryBERT: Pushing the Limit of BERT Quantization
Bai et al. [6] established the pioneer work for Binary BERT Pre-Trained Models. They first
studied the potential rationales behind the sharp drop from ternarization to binarization
of BERT. They begin with comparing the loss landscapes of full-precision, ternary, and
binary BERT models. In detail, the parameters W1, W2 from the value layers of multi-head
attention in the first two transformer layers are assigned with the following perturbations
on parameters:
˜
W1 = W1 + x · 1x,
˜
W2 = W2 + y · 1y,
(5.18)